This report investigates the performance of different aggregation
methods for forecasting competition assessment, using the RCT-A dataset
from the HFC competition. I evaluated the five aggregation methods and
proposed an improvement based on the best-performing method.
The dataset was analysed using the
data.table R package, which allows fast
and memory efficient handling of data.
The first-year competition data comes in three main datasets:
rct-a-questions-answers.csv dataset
contains metadata on the questions, such as dates, taggs, and
descriptions. Variables that are important to this assignment are:
discover IDs for the questions and answers (for joining of datasets),
and the resolved probabilities for the answers (i.e. encoding for the
true outcome).
rct-a-daily-forecasts.csv dataset
contains daily forecast for each performer forecasting method, along
with indexes that allow joining this dataset with the other crucial
datasets. Variables that are important to this assignment are: date,
discover IDs for the questions and answers, external prediction set ID
(i.e. the ID that is common to to a predictor that is assigning
probabilities to a set of possible answers), and the forecast value
itself.
rct-a-prediction-sets.csv contains
information on prediction sets, along with basic question and answer
metadata, forecasted and final probability values, along with indexes
that allow joining this dataset with the other datasets. This dataset
seems to be redundant, as the important information can be found in the
first two datasets.
To reduce the size of the datasets, only the relevant columns of
rct-a-questions-answers.csv and
rct-a-daily-forecasts.csv were selected.
These were:
From rct-a-daily-forecasts.csv:
datediscover question iddiscover answer idforecastcreated atexternal prediction set idFrom rct-a-questions-answers.csv:
discover question iddiscover answer idanswer resolved probabilityThe variables of interest were assessed for the presence of missing
values, and these were subsequently removed. Lastly, only the most
recent predictions per predictor per day were included in the analysis
(although it seems that
rct-a-daily-forecasts.csv dataset already
contained only single predictions per predictor per day).
I aggregated the individual forecasts for each of the question-day pair the using five different methods:
To evaluate the accuracy of each aggregation method, I computed the Brier score, which measures the mean squared error between the aggregated forecast and the actual outcome.
The Brier score is a measure of how close the predicted probabilities are to the actual outcomes. It is defined as the mean squared error between the predicted probabilities \(\hat{p}_i\) and the known outcomes \(y_i\), given by the formula:
\[ \text{Brier Score} = \frac{1}{n} \sum_{i=1}^{n} \sum_{j=1}^{r} \left( y_i - \hat{p}_i \right)^2 \]
where:
Brier score ranges from 0 to 1, where low values indicate better predictive capabilities.
Following table shows the Brier scores for each
question-day pair per aggregation method used. The final two columns,
Best_Method and
Ranked_Methods, show the best performing
method (i.e. method with the lowest Brier score) and
the order of the method performance, respectively:
The best performing aggregation method was geometric mean (47.79% of
prediction-day pairs (PDPs)), followed by the geometric mean of odds
(31.42% of PDPs), median (10.54% of PDPs), and the arithmetic mean
(10.25% of PDPs). The trimmed arithmetic mean never outperformed the
other methods. This data suggest that methods that ignore information
from extereme predictions (such as median, mean, and trimmed mean) fail
to capture the true information from aggregate prediction. The geometric
mean and geometric mean of odds appear to compete for the best
prediction method, likely based on the nuances of the structure of the
question and possible answers. Therefore this data suggest that the
nature of question would dictate which aggregate method to use to most
properly assess the aggregate performance of the predictors.
I propose an improvement to the geometric mean of odds by extremising the odds to penalise under-confidence in forecasters.
The extremised geometric mean of odds is calculated in the following steps:
\[ \text{Odds}(p_i) = \frac{p_i}{1 - p_i} \]
\[ \text{Geometric Mean of Odds} = \exp\left(\frac{1}{n} \sum_{i=1}^{n} \log\left(\text{Odds}(p_i)\right)\right) \]
\[ \text{Extremised Odds} = \left( \text{Geometric Mean of Odds} \right)^{2.5} \]
\[ p_{\text{extremised}} = \frac{\text{Extremised Odds}}{1 + \text{Extremised Odds}} \]
Prior to the assessment of the improved, best performing method, the
rct-a-daily-forecasts.csv dataset was
filtered to include only the data from the first day of the
competition.
Best_Method and
Ranked_Methods, show the best performing
method (i.e. method with the lowest Brier score) and
the order of the method performance, respectively:
The best performing aggregation method was the extremised geometric mean
of odds (42.86% of PDPs), followed by the arithmetic mean (28.57% of
PDPs), median (19.05% of PDPs), the geometric mean (4.76% of PDPs), and
the geometric mean of odds (4.76% of PDPs). The trimmed arithmetic mean
never outperformed the other methods. Evidently, the extremised
geometric mean of odds outperformed the other methods and thus was an
clear improvement in the prediction evaluation. The working principle
behind it is a modification of geometric mean of odds, where the
geometric mean of odds is raised to the power of an extremising
parameter, in this case equal to 2.5. This method is a correction for
forecaster under-confidence. In the present dataset it was able to
outcompete the other methods, however, it is likely that utilising it on
a different dataset, which would contain less forecaster
under-confidence would make it non-optimal.
The extremised geometric mean of odds provided the best aggregation performance, suggesting that penalising under-confident predictions can improve forecasting accuracy. However, the effectiveness of this method may vary depending on the dataset’s structure and the forecasters’ behaviour.